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Abstract. We introduce a signal processing model for signals in non-white noise, 
where the exact noise spectrum is a priori unknown. The model is based on a 
Student's t distribution and constitutes a natural generalization of the widely 
used normal (Gaussian) model. This way, it allows for uncertainty in the noise 
spectrum, or more generally is also able to accommodate outliers (heavy-tailed 
noise) in the data. Examples are given pertaining to data from gravitational wave 
detectors. 
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1. Introduction 

The measurement of gravitational radiation holds great promise for exciting 
astronomical observations [1, 2]. Around the world, efforts are under way to construct 
and improve detectors for gravitational waves; among these are the LIGO detectors 
in the US [3], GEO [4] and Virgo [5] in Europe, and TAMA in Japan [6], some of 
which have already started taking data. Plans for future detectors include advanced 
LIGO [7], LCGT in Japan [8], the Laser Interferometer Space Antenna (LISA) [9] and 
the Einstein Telescope (ET) [10]. As existing instruments are becoming increasingly 
sensitive and future instruments are approaching completion, sophisticated signal 
processing methods are required in order to detect and accurately interpret these 
often weak signals within a noisy environment. For example, signal detection is 
usually implemented via matched filtering [11, 12, 13], while parameter estimation is 
commonly done using Bayesian methods [14, 15] . Many of the data analysis procedures 
employed to date are based on the assumption of the noise being Gaussian with 
a known power spectral density [11, 16]. While these methods have proven to be 
computationally efficient and very powerful at discriminating rare, weak signals within 
noise, some more flexibility or robustness is sometimes desired. One example is the 
analysis of data to be expected from LISA, where the measurement noise originates 
partly from instrumental as well as astrophysical sources, and the noise spectrum it 
only vaguely known a priori [17]. 

We introduce an approach that was developed in the context of the latter case, 
where, along with the signal parameters to be inferred, the noise's power spectrum 
needed to be incorporated as an unknown into the model [18, 19]. The model developed 
here turns out to be computationally convenient, as the additional noise parameters 
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may be analytically integrated out, leading to a Student-i likelihood expression instead 
of the original normal formulation. We expect the same approach to be useful in 
many related signal processing contexts, as it allows to specify the prior information 
about the power spectrum, i.e., its expected magnitude and the associated certainty, 
in a straightforward manner. In this way, it should e.g. also be able to account for 
nonstationarities in the data, in the sense that the power spectrum does not necessarily 
need to be assumed to be exactly the same as at an earlier measurement; other 
fluctuations, like outliers, could also be accommodated in a similar manner. In fact, the 
resulting model actually falls into a class of generalizations to the standard Gaussian 
that have been proposed for their robustness properties: models accommodating 
heavier-tailed noise, or outliers, by either ('directly') implementing a non-Gaussian 
noise model or ('indirectly') substituting the corresponding least-squares procedures 
by less outlier-sensitive methods. Instead of the derivation based on an unknown 
power spectrum, the same model may be motivated by assuming the power spectrum 
itself to be random, so that the resulting noise mixture distribution exhibits a greater 
variability. In the limiting case of decreasing spectrum variability, the model again 
simplifies to the Gaussian. So the model will also be applicable as an ad-hoc alternative 
tunable robust model, with clearly interpretable "tuning parameters" . 

The organization of the paper is as follows. The time series setup is introduced 
in Section 2.1. In the subsequent sections, the probabilistic modelling is described, 
including its time-domain counterpart, the likelihood, prior distributions, posterior 
distribution, marginal likelihood and some implications. Section 3 describes the 
approach using two illustrative examples of simulated time series with and without a 
signal. The discussion section 4 puts the new approach into context. An appendix 
explicating the Discrete Fourier Transform conventions used in this paper is attached. 

2. The time series model 

2.1. The setup 

Consider a time series xi,. . . ,xjv of N real- valued observations sampled at constant 
time intervals At, so that each observation Xi corresponds to time ti = iAt. This set 
of N observations can equivalently be expressed in terms of sinusoids of the Fourier 
frequencies: 

= nuTT- X! "-j cos(27r/jtj) + sin(27r/jtj) (1) 

where the variables Oj and bj each correspond to Fourier frequencies ~ jAf = j^^. 
The summation in (1) runs from j = to j = [iV/2j , with lN/2\ denoting the largest 
integer less than or equal to N/2. By definition, bo is always zero, and 6[Ar/2j is 
zero if N is even; going over from Xi^s to a^'s and bj^s again yields the same number 
(N) of non-zero figures. The set of (iV) frequejicy domain coefficients Oj and bj 
and the time domain observations Xi are related to each other through a discrete 
Fourier transform (and appropriate scaling; see appendix Appendix A. 2). The set of 
trigonometric functions in (1) constitutes an orthonormal basis of the sample space, so 
that there is a unique one-to-one mapping of the observations in time and in frequency 
domain. Instead of the two amplitudes aj and bj, the definition in (1) may equivalently 
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be expressed in terms of a single amplitude and a phase parameter at each frequency: 
= /T77V- sm{27r fjU + LP j), (2) 



where Xj = a /fij + bj and 



arctanf — I if a, > 

ip, = { . ' (3) 

arctani ± tt] if < 0. 



For each Fourier frequency fj , let be the number of Fourier coefficients not being 
zero by definition, i.e.: 

1 if (j = 0) or {N is even and j = , , 

2 otherwise 

so that X]}=o^^ ^ Note that generally one may often simplify to Kj = 2 Vj 
without introducing a noticeable discrepancy, but in the following we will stick to the 
accurate notation of k depending on the index j. 

For a given time series (either in terms of Xi or Oj and bj), we define the functions 
of the Fourier frequencies 

a? + 62 

Piifj) = and (5) 

Kj 

= ^ . ^ . (6) 

Kj- Kj 

for = 0, . . . , [A^/2J , where x denotes the discretely Fourier-transformed time series x, 
as defined in appendix Appendix A. 2. These are the empirical, discrete analogues of 
the one-sided and two-sided spectral power. The set of pi{fj) is also known as the 
pcriodogram of the time series [20]. 

Now consider the case where the observations Xi, and consequently the Oj and 
bj, correspond to random variables Xi, Aj and Bj, respectively. This may mean that 
these are realizations of a random process, where the random variables' probability 
distributions describe the randomness in the observations, or that they are merely 
unknown, where the probability distributions describe a state of information, or it 
may also be a melange of both [2f]. 

The time series has a zero mean if and only if the expectation of all frequency 
domain coefficients is zero as well: 

E[X,] = Vi ^ E[ylj] = E[Bj] = Vj. (7) 

For the probabilistic time series, the spectral power consequently also is a random 
variable {Pi{fj) or P2{fj), respectively). We denote its expectation value by S*{fj); 
if the mean is zero as in (7), it is given by 



(J) Var(A,)+Var(gj) 



It is important to note that while the above expectation is closely related to 
a time series' power spectral density, it is yet quite different. Up to here we have 
considered the discretely Fourier transformed data and some of its basic properties. 
The figure S*{fj) refers to data sampled at a particular resolution and sample size 



Modelling coloured noise. 



4 



and is defined only at a discrete set of Fourier frequencies fj. One miglit want to 
refer to S* as tlie discrctizcd power spectrum. If in fact the data are a realization 
from a stationary random process with (continuous) power spectral density S{f), 
then the expectation S*{fj) is related to S{f) through a convolution, depending on 
the sample size N. Only in the limit of an infinite sample size both of them are 
equal [20, 22]. While both continuous and discretized spectrum are commonly used 
as an approximation to or in place of one another, it is crucial to realize that in the 
following inference will be done with reference to the case of a finite sample size and 
the discretized, convolved spectrum S*{fj). 

A related useful figure in this context is the integrated spectrum with respect to 
some frequency range [/i , /2] [23] , which here we define as 

%./.] = A,f]|5r(/,) (9) 

where the summation is done over the corresponding range of Fourier frequencies 
O'l = min{fc : fk > /i}, j2 = max{fc : fk < /2}, where /2 - /i > A/). The 
integrated spectrum allows to investigate or compare (discrete) spectra independent 
of the particular sample size N. 

2.2. The normal model 

Suppose the moments as defined in (7) and (8) are given. We may set up a 
corresponding normal time series model by assuming all the (frequency domain) 
observables to be stochastically independent, have zero mean and 

YciT{Aj)=a], (10) 

Var(B, ) = (kj - 1) <t| for j = 0, . . . , LiV/2j , (11) 

where 

= ^1* = ^3 Si if,) for J = 0, . . . , LiV/2j . (12) 

While other choices of variance settings matching the assumptions would be possible, 
the assumption of equal variances for Aj and Bj is the only one that makes the 
joint density of (Aj, Bj) a function of overall amplitude Aj alone, independent of the 
phase ipj, and with that leaves the model invariant with respect to time shifts. The 
same model may also be motivated via the maximum entropy principle, as the (zero 
mean, equal variance, independent) normal distribution maximizes the entropy given 
the constraints on the moments given in (7) and (8) above [24]. In that way, the normal 
model constitutes a most convervative model setup under the given assumptions 
[21, 22, 24]. 

The joint normal distribution derived here has actually been commonly used 
before (see e.g. [25]); the normality assumption may not only appear as a natural 
choice, but will also turn out computationally convenient in the following. In the 
context of Fourier domain data in particular, the normality assumption may also be 
motivated via asymptotic arguments, by considering the limit of an infinite observation 
time [26, 27, 28]. The resulting normal model also is exactly the same as the one 
underlying the so-called Whittle likelihood [29, 30], or the one at the basis of matched 
filtering [11, 12, 13]. While intuitively in other contexts it may often be sufficient to 
point out the asymptotic equality S*{fj) « S{fj) for iV — >■ oo, here it is crucial to 
appreciate what exactly the = SKfj) stand for in the case of a finite sample size N. 
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The difference between the exact and approximate ( "Whittle" ) model is explicated in 
more detail in [30]. 



2.3. The corresponding time-domain model 



An immediate consequence of the normality assumption for the frequency-domain 
coefficients {Aj, Bj) is that the time-domain variables (Xi), being linear combinations 
of the frequency-domain variables, also follow a normal distribution. The exact joint 
distribution of the Xi is completely determined by their variance/covariance structure, 
which may be expressed in terms of the autocovariance function. The covariance 
for any pair of time-domain observations and X„ (with m,n G {1, . . . ,iV}, and 
corresponding to times t^ and t„) is given by: 



Cov(X,„,X„) 



1 



lN/2\ 

^ (5*(/,)f cos(27r/,(t„ 

J=0 



t,n)))- (13) 



Note that the autocovariance Cov{Xjn, X^) = 7(in — tj^) depends on t^ and i„ only 
via their time lag i„ — This means that, remarkably though not surprisingly, both 
the invariance requirement or the application of the maximum entropy principle lead 
to a strictly stationary model [31]. Note also that 7(0) = Var(Xi) = I^o./n/,]- 

The autocovariance function allows one to express the distribution of the Xi in 
terms of their (N x N) covariance matrix Ex, which is the symmetric, square Toeplitz 
matrix 

/ 7(0) 7(At) 7(2At) ••• 7(At) \ 
7(A0 7(0) 7(At) 
7(2A0 7(A0 7(0) 



-'X 



7(At) 
7(2A0 
7(3A0 



(14) 



V 7(At) 7(2A0 7(3A,) ••• 7(0) / 

since 7 is periodic such that 7(iAt) — ^{{N + 1 — i)At). The periodic formulation 
in (1) makes the first and last observations Xi and Xn "neighbours", just as Xi 
and X2 are. This may seem odd (depending on the context, of course), but is not 
an unusual problem, as it arises for any conventional spectral analysis of discretely 
sampled data in the form of spectral leakage [32]; it may just not always be as obvious 
in its consequences. The above time domain expression sheds some light on the exact 
shortcomings of the Whittle likelihood approximation; you can see for example that 
it will be a poor approximation in case of predominantly low-frequency noise and a 
small number of observations. The problem may be tackled via windowing of the 
data, or it will also lessen with an increasing sample size; again, the features of this 
approximation are discussed in more detail in [30]. 



2.4- Incorporating an unknown spectrum 

2.4-1- The likelihood function Now suppose the spectrum is unknown and to be 
inferred from the data xi, . . . ,xn- An unknown spectrum here is equivalent to the 
variance parameters ctqi ■ • • t^in/2\ being unknown. Assuming normality of Aj and 
Bj, the likelihood function (as a function of the parameters CTq, . . . , cj^^yjj) is: 

p{xi,...,XN\al,...,af^/2i) 
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lN/2\ 

n 

3=0 



— 1 



'271 a., 



cxp 



/ lN/2\ 

= exp -f log(27r)- ^ 
\ j=0 

( '-^^^^ K 

a exp - T 



log(crj 

l0g(52(/,)) 



2a| 



(15) 
(16) 
(17) 



The term Xj here denotes the jth clement of the Fourier-transformed data vector, 
as defined in appendix Appendix A.l and Appendix A. 2. The above likelihood (17) 
is exactly equivalent to the so-called Whittle likelihood, which is an approximate 
expression for stationary time series [29, 30]. 

If the noise spectrum is a priori known, the "log(5|(/j))" term may also be 
irrelevant (see (29)), and the model again reduces to the normal model described 
e.g. in [11] that is also the basis for matched filtering [12, 13]. In that case, the 
logarithmic likelihood may be conveniently expressed in terms of an inner product of 
vectors, allowing for a geometric interpretation that is expecially useful in the case of 
linear signal models [11, 16, 33]. 



2.4-2. The conjugate prior distribution For the model defined above, the conjugate 
prior distribution for each of the Cqj • ■ • scaled inverse -distribution 

with scale parameter s| and degrees-of-freedom parameter i^j : 



with density function 



(¥) 



i^,/2 



r(i.,/2) 



{a^} exp 



(18) 



(19) 



[34]. The dcgrces-of- freedom here denote the precision in the prior distribution, while 
the scale determines its order of magnitude. For increasing Vj the distribution's 
variance goes towards zero, and for Vj — > the density converges toward the non- 
informative (and improper) distribution with density /(u^) = that is uniform on 

■' "j 

log((T|), and which also constitutes the corresponding Jeffreys prior for this problem 
[35]. If (t| follows an Inv-x^ (vj , s^) distribution, then its expectation and variance are 
given by 



E[^?] 



'^j „2 
1^, -2 



and 



Var((j2) 



and the mean and variance arc finite for i>j > 2 and i/j > 4. respectively [34]. 



(20) 



2.4-3. The posterior distribution Using the conjugate prior, the posterior distribution 
of the for given data again is a scaled inverse x^-distribution: 



{xi, . . . , xn} ^ Inv-x^ (i^j + ^ 



(21) 
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where all the different parameters eorresponding to different frequeneies are mutually 
independent. Comparing prior (18) and posterior (21), one can see that the prior 
distribution might be thought of as providing the information equivalent to Vj 
observations (of coeffieients aj or bj) with average squared deviation s| [34]. 

The use of the conjugate prior distribution leads to a convenient expression for 
the posterior distribution of the variance parameters a^, and with that of the complete 
discrete spectrum. Also, if a'j ~ Inv-x^ {lyj , s'j) , then, since is a scale parameter, 
it follows that the distribution of the two-sided spectrum ^2 (/j ) = ct| / kj simply 
is liiv-x^{vj,s^/Kj). There is a deterministic relationship between given variance 
parameters cr| and the implied autocorrelation function ■y{t) (see (13)), and so for 
random the distribution of the corresponding j(t) may be numerically explored 
via Monte Carlo sampling from the distribution of the a'j. The expectation and 
variance of "f{t) may also be derived analytically: the expected autocovariance (13), 
with respect to the distribution of the fx^, is 

, LJV/2J 

E[7(0] = 1^ E (EK'] f cos(27r/,-t)) , (22) 

which is finite as long as E[(7'j] is finite for all j. Similarly, the variance of the 
autocorrelation is 

Var(7(0) = E (Vai-(^,') ^ cos(27r/,t)2 j , (23) 

which again is finite as long as Var((T|) is finite for all j. 

2.4-4- The marginal likelihood The use of the conjugate Inv-x^ prior (with lyj > 
degrees of freedom) for the variance parameters cr| allows to integrate the unknown 
noise spectrum out of the likelihood expression (17). The marginal likelihood function 
then is 

p{xi, . . . , Xat I 1^0, ■ • ■ , VlN/2i , Sq' • ■ • ' ■5LAr/2J ) 

= n / Pi^,,b,\af)p(a^W„s])da^ (24) 
1=0 ^0 

11 .. .. -j+^j r (^) ^ ' 



(X 



j=0 ^ -^jsj + (aj+fcp y 

n i+^fi^ (26) 



j=0 ^ 3 1 



which constitutes a product of Student-t densities with {uj + — degrees of freedom 
[34] . Using the uninformative, improper Jeffreys prior (with Vj ~ degrees of freedom 
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and density p{<^^) ^) yields the marginal likelihood 



p{xi,...,xn) oc JI (Kj'^IXjf ) = . 



[N/21 



(28) 



One may then also get a mixture of the above expressions ((26) and (28)) in case of 
a prior setting of partly zero and non-zero prior degrees of freedom. 

Note that the corresponding analogue likelihood expression in case of an a priori 
known spectrum (as e.g. in [29, 11, 30], see also (17)) was 



so that the (normal) sum-of-squarcs expression (29) in case of a known spectrum 
generalizes to the Studcnt-t expression (27) once one takes uncertainty about the 
spectrum's scale into consideration. The normal model in turn is the limiting case for 
increasing degrees of freedom. 

2.5. Robust inference via the Student-t model 

The Student-t marginal likelihood expression (26) suggests that considering noise as 
normal but with unknown spectrum is technically equivalent to viewing the noise 
itself as being i-distributed. In that way, the Student-t model may also be useful as 
an alternative for robust modelling, as the wider family of t-distributions includes the 
normal and Cauchy distributions as special or limiting cases. In other contexts the 
t-distribution is commonly used as a generalization of the normal model in order to 
accommodate heavier-tailed errors (see e.g. [36, 37, 38]). In contrast to the above 
derivation of the t-distributed noise based on prior/posterior distributions, the noise 
may also be considered as a scaJe mixture of normal distributions; that is, normal 
noise with a randomly varying scale (variance) [37]. Both relations may actually be 
used to motivate the use of the Student-t distribution for noise exhibiting certain 
nonstationarities (e.g., a noise spectrum that is slightly fluctuating over time), be it 
to model the "variability" or the "uncertainty" in the spectrum, both of which are 
mathematically the same here (except that uncertainty is reduced through learning, 
while the random variation is not). 

The use of heavier-tailed noise models already has repeatedly been advocated in 
the gravitational wave detection context, in order to gain robustness against outliers 
and similar deviations from the commonly assumed normality. Crcighton [39] for 
example suggests the use of a mixture of Gaussian noise and a (uniform) burst 
component in order to account for wide-band noise burst events. Allen et al. [40] 
also illustrate the use of a two-component Gaussian mixture in order to reduce the 
influence of outliers in the data, but more generally they advocate an approach also 
known as M-cstimation, namely the explicit downweighting or ignorance of outlier 
observations falling far into the tail of the noise distribution [41, 42]. Application 
of the above Student-t model may also be considered as a special case of robust M- 
estimation, based on a clearly interpretable noise model and associated parameters 



In case of an a priori known power spectrum, maximization of the normal 
likelihood (29) is the basis for the matched filtering approach commonly used in 
signal detection problems, when looking for signals of parameterized shape in noise 




(29) 



[38] 
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[12, 13, 16]. In Gaussian noise, the likelihood maximization then is equivalent to a 
least-squares approach. A filter based on the Student-i likelihood (26) may therefore 
be useful for cases of an unknown noise spectrum or non-Gaussian noise. Technically, 
likelihood maximization may e.g. work in an iterative manner, by iterating over 
conditional likelihoods (as in [37]). 

Note that since the likelihood (26) implies independent (as opposed to merely 
uncorrelatcd) errors, likelihood maximization here is different from least-squares 
estimation [43, 44, 37], and the likelihood might in fact exhibit multiple modes [45]. 
The (independent) Student-i model not only leads to a less drastic fall-off of the 
likelihood for extreme values, and hence reduced leverage of outliers, but it also 
implies non-spherical density contours for the joint distribution of the noise, effectively 
allowing for a fraction of scrambled individual noise residuals. A similar effect was 
pointed out by Creighton [39] when implementing a robust Gaussian/uniform mixture 
noise model: the approach would allow for excess noise in individual interferometers, 
so that a noise burst would essentially be automatically "vetoed" if it is only measured 
in one of several interferometers. Similarly, the non-spherical density contours of the 
Student-t model make it robust against odd data values at individual frequencies. 

2.6. Defining the prior distribution's parameters 

Depending on the particular application and context in mind, there may be different 
ways to sensibly specifying prior parameters. Firstly, there are the supposedly 
uninformative priors; the Jeffreys prior [35] with = degrees of freedom was already 
mentioned above (see Sec. 2.4.2), and priors that are uniform on aj or on (t| may 
basically be considered as cases of lyj = — 1 and i^j = —2, respectively [46]. Care needs 
to be taken here though, as these priors (with Vj < 0) are improper distributions. 
These may lead to improper marginal likelihoods, as in the case of (28), which does 
not correspond to a normalizable probability distribution for the noise. The resulting 
posterior distribution of signal or noise parameters then may or may not be a proper 
probability distribution [34]. 

If the prior information on the spectral parameters is in the form of measurements 
(or samples) of the spectrum (of same size and resolution), then this may be expressed 
in terms of an equivalent sample size and corresponding prior degrees-of-freedom, as 
suggested in Sec. 2.4.3. Choosing e.g. the i^j to be twice the number of measurements 
of the spectrum taken (i.e., equal to the number of observed Fourier coefficients) would 
then be equivalent to initially assuming an uninformative Jeffreys prior and then using 
the posterior based on the measurements as the prior for the actual (signal) analysis. 
This will of course only make sense if one assumes the spectrum unknown but constant. 

When using the Student-i model as a robust model accommodating for heavy- 
tailed noise, i.e., when the noise itself is assumed to actually follow a i-distribution that 
one can sample from, then one can use sample estimates for the Student-i parameters. 
Moment estimators for and i^j (based on sample variance and kurtosis) are given 
e.g. in [47, 48]. 

When specifying prior parameters that are supposed to reflect information and/or 
variability, it may be helpful to consider the implied moments or quantiles, for 
individual frequency bins or frequency bands. The expressions for the prior's moments 
in (20) may be inverted to 




+ 4 and s| = 




(30) 
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which allows one to specify the scale s| and degrees-of-freedom based on pre-defined 
prior expectation and variance of ct^ , respectively. Note that the degrees-of-freedom 



then are simply a function of the prior variation coefficient y Var(cr|)^E[(T|]. A 

specification of independent of j means a priori white noise, and specifying 
individual Uj for different j indicates varying prior certainty across the spectrum. A 
sensible definition of the prior certainties for the individual spectrum parameters may 
be complicated by the fact that the exact meaning of this discrete set of parameters 
depends on the sample size N. In that case it may be helpful to instead consider 
the integrated spectrum (9) and its a priori properties. The (prior) moments of the 
integrated spectrum are given by 

mi.M] = A, £ f E[a|] = A, f I -^J (31) 

3=Jl 

(if all Vj > 2), and 

Vardi,,,,,) = A?. £ I ( ^^4-4) -^ ^''^ 

j=ji ' w 

(if all Vj > 4). The (prior) variation coefficient for the power within any frequency 
range then is: 



^Var(%j^]) _ \j^3=3l 4 (1^,-2)^(1., -4) 



which simplifies in case all d.f. parameters are taken to be equal (vj = v): 



(33) 



^32 '-j A 



2 

2 "j 



(34) 



and simplifies further in case all scale parameters are taken to be equal (s^ = s ): 



In the general case of (ji > 0, j2 < [A^/2J) this is: 

VVar(%,j,]) _ 1 (36) 



E[% j2]] \ v-A \/j2 - .n + 1 

and similarly, in case {ji — 0, and ^2 — N/2, N even) it is: 
^Var(%j„^^]) ^ r^\f' 



JV-l 

— . (37) 



This may be useful if one wants to specify piecewise constant prior settings with given 
constraints on the overall power per frequency range; this would then lead to the d.f. 
settings 

V = 4+^-.(^BM2]^Y (38) 



or 1/ = 4 



y'Var(I[y:j,j2]) 
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Figure 1. The example data iii in the time domain, and its (empirical) power 
-j^|njP in the Fourier domain. The solid line in the right panel shows the 
theoretical (2-sided) power spectral density in comparison. 



respectively, for the above two cases ((36), (37)). 

For example, if one wants the spectrum to be a priori white with some scale s^, 
such that the marginal prior mean and variation coefficient of the integrated power 
^lfQjN/2] ^ Var(Xj) ^ 7(0) are given by 



E = and ; ; = c, (40) 

then a setting of 



E|2^[/0,/iV/2] 



= and si = 2At'^e (41) 

independent of j will yield an a priori white spectrum with constant expectation and 
variation in the overall variance I[/o./„/,] for any sample size N. Pieeewise constant 
settings across different frequency bands may be implemented analogously. In case all 
Uj > 4, the integrated power, being a sum of random variables with finite mean and 
variance, will be asymptotically normally distributed. 

Instead of using moments, the prior may also be specified in terms of quantiles, 
for example by first deciding on the degrees-of-freedom settings and then aiming the 
prior median at a certain value. The quantiles of an Inv-x^(i', s^) distribution may be 
derived based on the quantiles of a x^-distribution; the p-quantile is then given by 

^.^7x';i-^ (42) 

where xt-i-^ is the (1 — p)-quantile of a x^-distribution with i/ degrees of freedom. 

Finally, in the M-estimation context, the ^-distribution's d.f. parameter may also 
be set based on the shape of the corresponding inHuence function [41, 42]. 

3. Examples 

3.1. Noise only 

Consider a time series n{t) oi N ^ 100 data points sampled at times ti = j^. As 
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Figure 2. Posterior distributions of the 51 spectrum parameters (t| based on tiie 
data shown in figure 1. The left plot shows the posterior corresponding to the 
uninformative (and improper) Jeffreys prior. Posterior expectations do not exist 
in this case. The right plot corresponds to assuming a priori white noise with 
Uj = 3 degrees of freedom for each frequency bin; the dashed line marks the prior 
expectation value. 



a simple example of non-white noise, the data are generated from an autoregressive 
process 

n{U) = fn(t,_i) + a:(t,), (43) 

where the innovations x{ti) are drawn independently from a uniform distribution 
across the interval [— -s/S, a/S] , so that they have zero mean and unit variance. The 
overall variance of the process defined this way is Var(n(ti)) = 2.29; due to the 
positive correlation of subsequent samples it has a higher power at low frequencies 
and less at high frequencies. Figure 1 shows a noise sample generated using the above 
prescription, together with its theoretical power spectral density. 

We will now apply the noise model introduced above and derive the posterior 
distribution of the noise parameters. Assuming one has a rough idea of the noise 
variance, we set the prior scale parameters so that the noise is a priori white 
with a prior expectation of E[Var(n(ii))] = 2.50. Using = 3 degrees of freedom, 
the prior scale and prior expectation then are = 0.0166, and E[cr|] = 0.05. The 
prior variances Var(CT|) do not exist for 3 degrees of freedom. Figure 2 illustrates 
the resulting posterior distribution for all 51 noise parameters (21) in comparison 
to the case of using the uninformative (and improper) Jeffreys prior. The Jeffreys 
prior (with i/j = degrees of freedom for each frequency bin) does not depend on 
the scale parameters s|, and the resulting posterior with < 2 degrees of freedom at 
each frequency does not have finite expectation values. Note also that the posterior 
distributions corresponding to the first and last frequency bin (zero and Nyquist 
frequencies, CTq and ctIq) are wider than the others in both cases, as they have one less 
degree of freedom. 

Figure 3 shows the posterior distributions of autocovariance and variance, which 
are functions of the individual spectrum parameters a^. The variance may either be 
considered as the zero-lag autocovariance 7(0) (see (13)), or as the integrated power 
-^[o,/jv/2] across the whole frequency range (see (9)). 
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Figure 3. Posterior distributions of autocovariancc and variance; the 
distributions shown here were derived via Monte Carlo integration. 



3.2. A signal with additive noise 

In the following example we will consider a time series where the primary interest 
is in a signal component, while the additive noise component will be modelled using 
the approach introduced above. As in the previous section, we will again consider 
N ~ 100 data points y{ti), . . . , j/(tioo) that are modelled as 

= ffJ^a^U) + ns{t{), (44) 

where ngiti) is non- white noise of unknown spectrum, and fffa 4,{^) ^ "chirping" 
signal waveform of increasing frequency: 

//./,a,0(i) = asin(2^(/ + /i)t + 0) (45) 

where / and / are the frequency and frequency derivative, a is the amplitude, and cj) 
is the phase. The noise again is generated the same way as in the previous example, 
by simulating an autoregressive process. 

We define the signal parameters' prior as uniform for phase, frequency and 
amplitude (0 g [0, 27r], / e [1,50] and a G [0,10]) and normal for the frequency 
derivative / (zero mean and standard deviation 5). The prior distribution for the noise 
parameters ctq, . . . ,a|o is set exactly as in the previous section 3.2. For comparison, 
we analyze the data two more times, once assuming the true noise spectrum to be 
known, and once assuming the noise to be white, but with an unknown variance. In 
the latter case, we again assume a (conjugate) Inv-x^ prior with an expectation of 2.5 
and 3 degrees of freedom for the variance parameter. 

The posterior distributions of signal and noise parameters may now be derived 
via Monte Carlo integration; we implemented a Metropolis sampler to simulate draws 
from the joint posterior probability distribution of all parameters [34]. For the model 
including the coloured noise spectrum as unknown, the signal parameters (/, /, a, 
(j)) may be sampled based on the marginal likelihood expression (27), while samples 
of the 51 noise parameters (ctq, . . . ,(j|g) may then be sampled in an additional step 
via the conditional distribution of noise parameters for given signal parameters and 
the corresponding implied vector of noise residuals (21). Similarly, for the known 
spectrum model, sampling of the four signal parameters may be based on the likelihood 
expression (29), while sampling for the white noise model may be done using a 
Gibbs sampler alternately sampling from the conditional distributions of four signal 
parameters and the noise parameter [34]. 
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unknown, coloured 
fixed, coloured 
unknown, white 




Figure 4. Marginal posterior densities of the four individual signal parameters. 
The three densities in each panel result from three different models applied for the 
noise term. The "unknown, coloured" case corresponds to the model introduced 
above, the "fixed, coloured" case assumes the true noise spectrum to be a priori 
knowrn, and in the "unknown white" case the noise is modelled as white v^ith 
unknown variance. The vertical dashed lines indicate the true parameter values. 




Figure 5. Marginal posterior distributions of the noise parameters. The left 
panel shows the posterior distribution for the 51 individual parameters of the 
coloured noise model. The right panel shows the prior and posterior distributions 
of the integrated power in comparison with the corresponding variance parameter 
in the white noise model. 



Figure 4 shows the resulting marginal posterior distributions of the four signal 
parameters in comparison. One can see that application of the more flexible coloured 
noise model for the error term yields a more precise posterior distribution than the 
white noise model, and in fact the posterior is more similar to the case when the 
true noise spectrum is plugged into the model. Looking at the (marginal) posterior 
distributions of the noise parameters in figure 5, one can see that although the noise 
spectrum is only recovered with great uncertainty, this does not seem to harm the 
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estimation of the signal parameters of aetual interest. While the overall variance is 
better recovered by the single-parameter white noise model, the adaptability of the 
coloured noise model to the predominantly low-frequency noise (which is reflected in 
the posterior) seems to be the greater advantage. 

4. Discussion 

This work originated out of the Mock LISA Data Challenges (MLDC) [49], a 
gravitational wave parameter estimation effort in the context of the planned Laser 
Interferometer Space Antenna (LISA). Here parameters of a signal were to be inferred, 
where the signal was buried in noise known to be non-white and interspersed with a 
host of individual emission lines [17]. So the problem was to model a non-white, 
non-continuous spectrum where the shape of the spectrum was only vaguely known 
in advance [18, 19]. Also, the spectrum itself was not of primary interest, but rather 
a nuisance parameter that still needed to be accounted for along with the actual 
signal. Application of the method described here solved the problem and allowed 
the implementation of an MCMC algorithm based on a straightforward generalization 
of the commonly utilized Gaussian noise assumption, where modelling the spectrum 
complicated the analysis only slightly. 

Several approaches to the simultaneous estimation of signal and noise parameters 
have been proposed before. An obvious way to model non-white noise would be to 
assume a particular time series formulation that allows for some flexibility in the 
resulting spectrum, for example an autoregressive (AR) model. Representations of 
this kind have been applied in various signal processing contexts, e.g. for detecting 
sinusoidal signals and estimating their parameters when these are buried in coloured 
noise [50, 51], or in the context of musical pitch estimation, where the signals to 
be modelled again are sinusoidal, but include higher harmonics and time-varying 
amplitudes [52] . If one is only interested in a narrow frequency band of the data, it 
may also be appropriate to model the noise spectrum as constant across the concerned 
range [53]. Regarding data treated in their Fourier domain representation, there is 
a rich literature concerned with the measurement of spectral densities (see e.g. [26] 
and references therein). However, there the aim is usually to produce consistent and 
smooth estimates of the spectrum, and the approaches applied include e.g. averaging 
[54, 32], smoothing via splines [55] or Bayesian model fitting [56, 57]. 

The approach to modelling the noise spectrum introduced here is different in 
that the spectrum per se is not of interest, or only of interest as far as it enters 
into likelihood computations. Smoothness or interpolation therefore are not primarily 
aimed for. What in fact is of concern is properly accounting for an a priori uncertainty 
in the (discretized, convolved) spectrum, in the frequency resolution that is given 
by the numerically Fourier transformed data, which then consequently entails the 
necessity for updating our knowledge about the spectrum as data is being processed. 
The resulting approach generalizes the model underlying the commonly used Whittle 
likelihood [29, 11, 30], and is hence essentially based on the "plain" periodogram 
of the noise time series, which for other purposes is commonly dismissed due to its 
unfavourable large-sample convergence behaviour. The model is very flexible as it is 
built upon a "binned" spectrum estimate without introducing any extra assumptions 
on the shape of the underlying noise spectral density. Its generality and simplicity 
make it useful for modelling residual noise at very little computational cost. By the 
way it is set up, the model is very general; it is a generalization of the model that 
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constitutes the basis for matched Gltcring (e.g. [12, 13]) and that is commonly applied 
in signal processing problems (e.g. [11]), which then in turn constitutes the special case 
of an a priori known spectrum. The additional feature of the approach introduced here 
is that it allows to specify corresponding uncertainties in addition to the (prior) scale 
of the noise spectrum. Marginalization over the uncertainty in the noise spectrum 
yields a Student-t model for the Fourier-domain data as a natural generalization of 
the common normal model. Due to the straightforward intcrpretability of the model 
and its computationally convenient form, we expect it to be particularly useful for 
modelling an unknown noise spectrum that constitutes a nuisance parameter rather 
than being of interest in itself. In that sense, the approach is not primarily aimed at 
gaining information about an unknown spectrum, but rather at properly accounting 
for uncertainty, and avoiding bias from supposedly precise a priori knowledge of the 
spectrum. 

Alternatively, the Student-i model may also be viewed as a generalized, robust 
model accommodating for heavier-tailed, non-Gaussian, or non-stationary noise. 
Related approaches are commonly used in many other applications in the context 
of robust statistics [36, 37, 38, 41, 42]. In fact, the use of similar methods 
for accommodating noise outliers have already been advocated in the context of 
gravitational wave signal processing [39, 40]. Along similar lines, Clark et al. [58] 
implemented a (sine-Gaussian) noise-glitch component into the noise model. Principe 
and Pinto [59] suggested the use of a dictionary of glitch "atoms" in order to account 
for burst-like noise events. Veitch and Vecchio [60] implemented a coherence test based 
on a model selection procedure that is able to discriminate noise glitches (that appear 
independently in the data) from actual astrophysical events (that appear coherently 
in several data streams). Similarly, Littenberg and Cornish [61] extend their model to 
allow for excess noise that is isolated in time and frequency via a wavelet approach. 
Middleton [62] treated the general case of Gaussian noise with superimposed Poisson- 
distributcd impulse noise bursts. We are planning to investigate the performance and 
sensitivity of a Student-t model for robust detection and parameter estimation in real 
interferometer noise 

We expect the approach to be especially useful as a model component properly 
accounting for non-white residual noise, at little computational cost and without 
introducing overly restrictive assumptions about the noise. The interpretability of 
the Student-t model in the context of robust modeling and M-estimation also makes it 
straightforwardly useful for robust inference. In that way, it may particularly be useful 
in signal processing contexts [63, 13], but it should be applicable in any case where a 
model for (residual) noise is needed [23, 64]. The fact that the model represents noise 
properties in terms of its power spectral density and is specified through physically 
meaningful parameters may make it particularly appealing in physical or engineering 
applications, where modelling is commonly based on Fourier-domain descriptions, and 
a time-domain formulation might be hard to incorporate or motivate. 

The above basic model may in future be extended by introducing smoothness 
constraints on the spectrum. This might for example be approached by considering 
correlations between neighbouring spectral bins, or by assuming a piecewise constant 
spectrum, similar to what was done in [53]; the latter approach would in fact be 
a compromise between two models used in the example discussed above (Sec. 3.1), 
namely a fiat spectrum and individually modelled frequency bins. Another interesting 
extension would be the incorporation of cross-spectra [23, 65]. Some of the methods 
described here have been coded as an R software extension package and are available 
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Appendix 

Appendix A.l. Discrete Fourier transform (DFT) 

The Fourier transform convention used in this paper is specified below; it is defined 
for a real- valued function h of time t, sampled at N discrete time points, at a sampling 
rate of and it maps from 

{h{t) e R : t - 0, At, 2At, ...,{N- 1)AJ (A.l) 
to a function of frequency / 

{h{f) e C : / = 0, A^, 2A^, . . . , (iV - 1)A^}, (A.2) 
where Aj = ^^'^ 

N-l 

Hf) = ^/i(jAt)exp(-27rijAJ). (A.3) 

j=o 

The inverse DFT then is given by 



'^(^) ^ ^ E hU^f)cxpi2mjAft) (A.4) 

[22]. 



N 

3=0 



Appendix A.2. Relationship between DFT and time series model 
Let 

aj = Re(/i(/j)) and = Im(/i(/j)), (A.5) 

i.e.: h{fj) = aj + /3ji. For simplicity, in the following N is assumed to be even; for 
uneven N the derivation is similar. The inverse DFT was defined as (A.4): 

N-l 



-f -1 

^ E [(2a,cos(2^/,i) + 2(-/3,)sin(27r/,t)) 



N 



+ W^o + ;^a7V/2 cos(27ri/Ar/2t) (A. 7) 

where t e {0, At, 2At, . . . , (iV - l)At}, and = jAj = are the Fourier 

frequencies. So, comparing to (1), one can see that the realizations of qq, . . . ,a[Ar/2j 
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and 60, . . . , & [Ar/2j are derived from a given time series by Fourier-transforming and 
then setting 

= '^]\[^<^] and hj = -~Kj 
for j = 0, . . . , lN/2\ , which especially imphes that 



N 



N 



Hfi)\ 



(A.9) 
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